Prosper is a peer to peer lending company that offers personal loans at low rates. These loans are unsecured, which means you do not have to put up any collateral (like a house or car) that could get taken away if you can’t make payments. Each loan is typically funded by multiple people all over the United States. [LendingMemo]
The reason for analysing this dataset is because I have the interest in finding out how different variables affect the interest rates. As a young graduate out of university and looking to build a family next time, it is also better for me to investigate deeper into this subject so that at the same time I can also gain personal finance knowledge.
I would also like a challenge for an intermediate dataset so that it pushes me to learn more about data analysis and gain experience from it.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2005 2008 2012 2011 2013 2014
For LoanOriginationyear, it can be seen that there is a general increasing amount of loans as the years past. This is a good sign for the Prosper business model.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 7.000 6.592 10.000 12.000
For LoanOriginationmonth, there is no obvious trend going on for the LoanOriginationmonth except for the dip in the month of April. However, not enough information can be seen in this plot.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 4.000 6.000 5.933 8.000 10.000 29084
The chart of ProsperScores of borrowers shows a normal distribution with only a small number of borrowers falling the lower end of ProsperScore 1 as compared to the other end of ProsperScore 10.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 13.40 18.40 19.28 25.00 49.75
This plot is positively skewed and there is also a spike in BorrowerRate at 32%.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 26.00 67.00 96.07 137.00 755.00 7625
This plot is very positively skewed. It shows that most of the borrowers are still in the early stage of their careers or life. It mightr be because more liability can be taken up by them to leverage into getting education or house.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.00 36.00 36.00 40.83 36.00 60.00
This plot shows that the 36 months Term is the most popular of the 3 repayment programme followed by 60 months Term and then 12 months Term.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.155 3.000 99.000 990
This plot is very positively skewed. It shows that most of the borrowers are either usually punctual in their payment or they are still young and do not have much opportunities for such a bad track record.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
The mean of this plot is at 27.6%. It is also positively skewed. It means that most of the borrowers are only willing to take up about a quarter of their annual income as their loan. It also shows the risk that the borrowers are willing to take.
## Not displayed Not employed $0 $1-24,999 $25,000-49,999
## 7741 806 621 7274 32192
## $50,000-74,999 $75,000-99,999 $100,000+
## 31050 16916 17337
Most of the borrowers fall under the IncomeGroup of $25,000-49,999. It shows that most of the borrowers are still in the early stage of their careers or life. It mightr be because more liability can be taken up by them to leverage into getting education or house.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 1627 2930 4127 23450 91852
This is a positively skewed plot. However, this plot only shows the current prosper principal outstanding at the time the data was being extracted. Thus, the data displayed would have been invalid to be evaluated based it alone. Other variables should be included to find out more information.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
It can be observed that most of the borrowers tend to choose LoanOriginalAmount in the multiples of $5000.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2252.0
This shows a positively skewed chart. But as monthly loan payment is usually a percentage of the original principal amount, not much can be seen from this chart.
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted Delayed
## 56576 5018 2067
## FinalPaymentInProgress
## 205
It can be seen that the current borrowers in loan are 56576 which is majority of Prosper’s overall clients. With every 100 people who complete the loan payment on time, there will be around 13 people who will default on their loan.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04803 0.00000 39.00000
Most of the borrowers do not have recommendations by others and are also able to take up a loan.
## Employed Full-time Not employed Other Part-time
## 67322 26355 835 11408 1088
## Retired Self-employed
## 795 6134
Majority of the borrowers have jobs. This might be a requirement for them to take up the loan or for lenders to offer them.
## False True
## 56459 57478
The number of home owners and non-home owners are almost equal.
## False True
## 8669 105268
Most of the borrowers have their income verified.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -18.27 11.57 16.15 16.87 22.43 31.99 29084
It can be seen that most of the EstimatedEffectiveYield lies in the positive region while only a small amount lie in the negative region.
The original dataset comprised of 113937 observations and with 81 variables. They are in the form of class num, int, Factor.
Features of interest
It is impossible to analyse such a big dataset for the scope of this project. Thus, variables have been narrowed down to 20.
The 20 Features are:
Discrete Data (8): - LoanOriginationyear - LoanOriginationmonth - Term - IncomeRange - LoanStatus - EmploymentStatus - IsBorrowerHomeowner - IncomeVerifiable
Continuous Data (12): - ProsperScore - BorrowerRate - EmploymentStatusDuration - DelinquenciesLast7Years - DebtToIncomeRatio - ProsperPrincipalOutstanding - LoanOriginalAmount - MonthlyLoanPayment - Recommendations - EstimatedEffectiveYield - LenderYield - DifferenceInRate
The main feature of interest is the BorrowerRate.
ProsperScore is a custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.
However, I would like to investigate how much the other variables will affect the BorrowerRate and EstimatedEffectiveYield.
Yes, the following will describe the new variable created.
DifferenceInRate This new feature has been engineered to find out the difference in BorrowerRate and EstimatedEffectiveYield.
ProsperScore It is observed that ProsperScore has values of 11. This is strange as the variable can only range from 1-10. Thus, values of 11 has been changed to 10 as 11 is beyond the maximum value of 10.
LoanOriginationDate This is split into LoanOriginationyear and LoanOriginationmonth to observe any trend in months or years.
EmploymentStatus Values in this features that are empty or “Not available” are being changed to “Other” Using factoring to rearrange the sequence of categories being displayed. This will improve the plotting of charts and analysis later.
LoanStatus A new label has been created, “Delayed”, to represent all the LoanStatus that are past their due date.
Splitting the data into Categorical Data and Continuous Data so that proper correlation plots using ggpairs can be done in the next step. The following plots are for me to the correlation between all the features and also to visualise a rough plot between them before I choose the appropriate ones for deeper analysis.
Higher Correlations with BorrowerRate: - ProsperScore - LoanOriginalAmount - MonthlyLoanPayment - Others like LenderYield & EstimatedEffectiveYield are needless to say also highly correlated to BorrowerRate
Higher Correlations with EstimatedEffectiveYield: - ProsperScore - LoanOriginalAmount - MonthlyLoanPayment - Others like BorrowerRate & LenderYield are needless to say also highly correlated to EstimatedEffectiveYield
From this plot, I can see that there is a certain pattern for a Term and IncomeRange when it is plotted against the other variables. This made me interested in using these two variables for further investigation with the main feature of interest.
With a higher ProsperScore, the BorrowerRate can be lowered.
It can be seen that as the LoanOriginalAmount gets larger, the range of difference in BorrowerRate decreases. It converges into around 10-20% BorrowerRate at the maximum LoanOriginalAmount of $35000.
As can be seen in the boxplots, the mean of BorrowerRate becomes lower at a higher IncomeRange. This may be correlated to their ability to support the loan and thus a lower BorrowerRate with less risk of them being unable to support the loan.
With a higher ProsperScore, the EstimatedEffectiveYield which is also the returns of the invested loans will be lower as there is less risk. However, for lower ProsperScore, there is also a risk of getting negative EstimatedEffectiveYield.
It can be seen that lenders that lend their money to people who have income will have a higher tendency of having negative EstimatedEffectiveYield compared to lending money to those who have $0.
It can be seen that as the ProsperScore increases, the LoanOriginalAmount has a bigger range. Those with better ProsperScore tend to be able to get higher LoanOriginalAmount. It is also interesting that the points are concentrated where LoanOriginalAmount is in multiples of $5000.
From the plot shown, it can be seen that people tend to choose a range of 2.5% to 10% of their LoanOriginalAmount as their MonthlyLoanPayment.
It can be seen that as the ProsperScore increases, the DebtToIncomeRatio has a smaller range. People with lower ProsperScore are getting higher debts and putting more strain on their income.
With a higher ProsperScore, the BorrowerRate can be lower. It can be seen that as the LoanOriginalAmount gets larger, the range of difference in BorrowerRate decreases. It converges into around 10-20% BorrowerRate. As can be seen in the boxplots, the mean of BorrowerRate becomes lower at a higher IncomeRange. This may be correlated to their ability to support the loan and thus a lower BorrowerRate with less risk of them being unable to support the loan.
With a higher ProsperScore, the EstimatedEffectiveYield which is also the returns of the invested loans will be lower as there is less risk. However, for lower ProsperScore, there is also a risk of getting negative EstimatedEffectiveYield.
It can be seen that lenders that lend their money to people who have income will have a higher tendency of having negative EstimatedEffectiveYield compared to lending money to those who have $0.
Plotting multiple features up against each other in ggpairs, the correlation can be seen between the features. The highest correlation value is between LoanOriginalAmount and MonthlyLoanPayment. However, this likely due to borrowers giving fixed MonthlyLoanPayment which is proportional to the LoanOriginalAmount.
If we were to look at the strongest relationship with BorrowerRate, the ProsperScore is most correlated to it.
In this plot, we can see how the different BorrowerRate will differ for each IncomeRange groups. It also shows the range of LoanOriginalAmount that the borrowers will take up.
This plots shows the BorrowerRate against DebtToIncomeRatio with colour gradient based on ProsperScore. It can be seen from the plot that for a borrower to have a low BorrowerRate, it is better to have a better ProsperScore and also low DebtToIncomeRatio.
This plot shows the BorrowerRate against ProsperScore and grouped by Term. The distribution of the loans can be seen clearly here grouped according to Term. 36 months Term is the most popular across all ProsperScore borrower profiles.
This plot shows the BorrowerRate against the ProsperScore grouped by IncomeRange. In this plot, it also shows the distribution of the IncomeRange groups using the width of the boxplots.
The variables are LoanOriginalAmount, DebtToIncomeRatio and Term. It is true that ProsperScore is also correlated to BorrowerRate.
Assuming that there is nothing one can do about the ProsperScore, those three variables are the things that can be changed to get a lower BorrowerRate.
In this plot, we can see how the different BorrowerRate will differ for each IncomeRange groups. It also shows the range of LoanOriginalAmount that the borrowers will take up. The scatter plot shows every points while the boxplots show the distribution of both the LoanOriginalAmount and BorrowerRate. It can be seen that the ranges of BorrowerRate for higher income groups is smaller and the mean of the BorrowerRate is lower for higher income groups. Looking at the widths of the boxplots, it can also be seen that the higher income groups have a wider range of LoanOriginalAmount as compared to the lower income groups. The LoanOriginalAmount is also lower in the lower income groups. It is possible that due to the lower income available to support the debt, it is more risky to loan a larger LoanOriginalAmount to them as compared to higher income groups.
This plots shows the BorrowerRate against DebtToIncomeRatio with colour gradient based on ProsperScore. It is shown that at the region of lower BorrowerRate, it is dominated by ProsperScore that are very high. It can also be observed that for those with a mid-range ProsperScore will tend to get a mid-range BorrowerRate and also go for higher amount of loan as compared to their income. This can be seen in the plot that there is a divergence towards the right as the BorrowerRate is higher and also that the points nearer to the right represent Mid-range ProsperScore.
This plot shows the BorrowerRate against ProsperScore and grouped by Term. The distribution of the loans can be seen clearly here grouped according to Term. 36 months Term is the most popular across all ProsperScore borrowers and points are very well distributed in BorrowerRate and ProsperScore. It can be seen that for borrowers choosing the 12 months Term for repayment, their BorrowerRate mean is also lower than higher months Term repayment plan. For the 60 months repayment Term, it has a smaller range of BorrowerRate but its mean is just slightly higher than that of the 36 months Term. From this plot, for someone who is looking for a lower BorrowerRate, choosing a shorter payment Term has a higher chance of getting a BorrowerRate.
From the analysis of the many plots done in this project, I can say that some of the variables are correlated to getting a lower BorrowerRate. The variables are LoanOriginalAmount, DebtToIncomeRatio and Term. It is true that ProsperScore is also correlated to BorrowerRate. However, assuming that there is nothing one can do about the ProsperScore, those three variables are the things that can be changed to get a lower BorrowerRate.
As for EstimatedEffectiveYield, it is very correlated with BorrowerRate. So if a lender were to try to increase the EstimatedEffectiveYield, they might be interested in taking up more risks and look for borrowers who have a higher LoanOriginalAmount, DebtToIncomeRatio and also longer Term.
Where did I run into difficulties in the analysis?
This dataset is a rather huge one for a beginner in R and also data analysis. Although there are 81 variables in this dataset, I can choose the few that I am interested in and perform some data exploration on them in univariate analysis and bivariate analysis before coming down to performing multivariate analysis on them.
I had some difficulties in the different types of parameters in plotting the graphs and also thinking about how it will be best to arrange the data for optimal comprehension.
Where did I find successes? Overall, I am satisfied with the skills that I have learnt while doing this project. I appreciate the visualisation of multiple variables and also analysing them to see patterns unfold.
How could the analysis be enriched in future work (e.g. additional data and analyses)? Of course, further analysis can be done for this project but it will be too huge and time consuming for just one project. Some ideas will be to analyse the ProsperScore and find out what are the variables that decide the score. So if someone is interested in getting a loan for a better BorrowerRate, they can be interested in improving their ProsperScore.